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^ I Abstract 

. We analyze convergence rates of stochastic optimization procedures for non-smooth con- 

^ ' vex optimization problems. By combining randomized smoothing techniques with accelerated 

gradient methods, we obtain convergence rates of stochastic optimization procedures, both in 
expectation and with high probability, that have optimal dependence on the variance of the 
. gradient estimates. To the best of our knowledge, these are the first variance-based rates for 

non-smooth optimization. We give several applications of our results to statistical estimation 
problems, and provide experimental results that demonstrate the effectiveness of the proposed 
^^^^ algorithms. We also describe how a combination of our algorithm with recent work on decentral- 

\my . ized optimization yields a distributed stochastic optimization algorithm that is order-optimal. 

52 ■ 1 Introduction 
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In this paper, we develop and analyze randomized smoothing procedures for solving the following 
CN , class of stochastic optimization problems. Let ^ € H} be a collection of real- valued 

functions, each with domain containing the closed convex set X C'R'^. Letting P be a probability 
■ distribution over the index set H, consider the function / : A' — )■ R defined via 

(N : 

/(x): = E[F(x;0] = jj{x;i)dP{i). (1) 



o 



In this paper, we analyze a family of randomized smoothing procedures for solving potentially 
non-smooth stochastic optimization problems of the form 



. mi^ {f{x) + ip{x)], (2) 

H ■ 

. . .' where : Af — t- R is a known regularizing function. Throughout the paper, we assume that / is 

convex on its domain X. This condition is satisfied, for instance, if the function is convex 

for P-almost every ^. We assume that ip is closed and convex, but we allow for non-differentiability 
so that the framework includes the £i-norm and related regularizers. 

While we will later discuss the effects that (^(s^) has on our optimization procedures, throughout 
we will mostly consider the properties of the stochastic function /. Problem (2) is challenging 
mainly for two reasons. First, the function / may be non-smooth. Second, in many cases, / cannot 
actually be evaluated. When ^ is high-dimensional, the integral (1) cannot be efficiently computed, 
and in statistical learning problems we usually do not even know what the distribution P is. Thus, 
throughout this work, we assume only that we have access to a stochastic oracle that allows us 
to get i.i.d. samples ^ ~ P, and consequently we focus on stochastic gradient procedures for the 
convex program (2). 
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To address the first difficulty mentioned above — namely that / may be non-smooth — several 
researchers have considered techniques for smoothing the objective. Such approaches for deter- 
ministic non-smooth problems are by now well-known, and include Moreau-Yosida regularization 
(e.g. [22]), methods based on recession functions [3]; and a method that uses conjugate and proximal 
functions [26]. Several works study methods to replace constraints f{x) < in convex programming 
problems with exact penalties max{0,/(x)} in the objective, after which smoothing is applied to 
the max{0, •} operator (e.g., see the paper [8] and references therein). The difficulty of such ap- 
proaches is that most require quite detailed knowledge of the structure of the function / to be 
minimized and hence are impractical in stochastic settings. 

The second difficulty of solving the convex program (2) is that the function cannot actually 
be evaluated except through stochastic realizations of / and its (sub)gradients. In this paper, we 
develop an algorithm for solving problem (2) based on stochastic subgradient methods. Although 
such methods are classical [30, 11, 28], recent work by Juditsky et al. [15] and Lan [18, 19] has 
shown that if / is smooth — its gradients are Lipschitz continuous — convergence rates dependent on 
the variance of the stochastic gradient estimator are achievable. Specifically, if o"^ is the variance 
of the gradient estimator, the convergence rate of the resulting stochastic optimization procedure 
is O[o I \fT\ Of particular relevance to our study is the following fact: if the oracle (instead of 
returning just a single estimate) returns m unbiased estimates of the gradient, the variance of 
the gradient estimator is reduced by a factor of m. Dekel et al. [9] exploit this fact to develop 
asymptotically order-optimal distributed optimization algorithms, as we discuss in the sequel. 

To the best of our knowledge, there is no work on non-smooth stochastic problems for which a 
reduction in the variance of the stochastic estimate of the true subgradient gives an improvement in 
convergence rates. For non-smooth stochastic optimization, known convergence rates are dependent 
only on the Lipschitz constant of the functions F{-;^) and the number of actual updates performed. 
Within the oracle model of convex optimization [25], the optimizer has access to a black-box oracle 
that, given a point x £ X, returns an unbiased estimate of a (sub)gradient of the objective / at the 
point X. In most stochastic optimization procedures, an algorithm updates a parameter xt at every 
iteration by querying the oracle for one stochastic subgradient; we consider the natural extension 
to the case when the optimizer issues several queries to the stochastic oracle at every iteration. 

A convolution-based smoothing technique amenable to non-smooth stochastic optimization 
problems is the starting point for our approach. A number of authors (e.g., [16, 32, 17, 38]) have 
noted that particular random perturbations of the variable x transform / into a smooth function. 
The intuition underlying such approaches is that convolving two functions yields a new function 
that is at least as smooth as the smoothest of the two original functions. In particular, let fi denote 
the density of a random variable with respect to Lebesgue measure, and consider the smoothed 
objective function 



where Z is a random variable with probability density fi. Clearly, is convex whenever / is 
convex; moreover, it is known that if ^ is a density with respect to Lebesgue measure, then is 
differentiable [4]. 

We analyze minimization procedures that solve the non-smooth problem (2) by using stochastic 
gradient samples from the smoothed function (3) with appropriate choice of smoothing density 
fi. The main contribution of our paper is to show that the ability to issue several queries to the 
stochastic oracle for the original objective (2) can give faster rates of convergence than a simple 




(3) 
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stochastic oracle. Our two main theorems quantify the above statement in terms of expected values 
(Theorem 1) and, under an additional reasonable tail condition, with high probability (Theorem 2). 
One consequence of our results is that a procedure that queries the non-smooth stochastic oracle 
for m subgradients at iteration t achieves rate of convergence 0{RLq/ \/Tm) in expectation and 
with high probability. (Here Lq is the Lipschitz constant of the function / and R is the ^2-radius of 
its domain.) As we discuss in Section 2.4, this convergence rate is optimal up to constant factors. 
Moreover, this fast rate of convergence has implications for applications in statistical problems, 
distributed optimization, and other areas, as discussed in Section 3. 

The remainder of the paper is organized as follows. In the next section, we review standard 
techniques for stochastic optimization, noting a few of their deficiencies. After this, we state our 
algorithm and main theorems achieving faster rates of convergence for non-smooth stochastic prob- 
lems using the randomized smoothing technique (3). We make strong use of the fine analytic 
properties of randomized smoothing, and collect several relevant results in Appendix E. In Sec- 
tion 3.1, we outline several applications of the smoothing techniques, which we complement in 
Section 3.2 with experiments and simulations showing the merits of our new approach. Section 4 
contains proofs of our main results, though we defer more technical aspects to the appendices. 

Notation: For the reader's convenience, here we specify notation as well as a few definitions. 
We use Bp{x,u) = {y ^ \ \\x — y\\p < u} to denote the closed p-norm ball of radius u around 
the point x. Addition of sets A and B is defined as the Minkowski sum in M"^, that is, A -\- B = 
{x ^ \ X = y + z,y A, z B}, and multiplication of a set A by a scalar a is defined to be 
aA := {ax \ x G A}. For any function or distribution fi, we let supp /U := {x \ f{x) ^ 0} denote 
its support. Given a convex function / with domain Af, for any x £ X, we use df{x) to denote its 
subdifferential. We define the shorthand notation ||9/(x)|| = suplH^H | g G df{x)} for any norm 
||-||. The dual norm ||-||^ with the norm ||-|| is defined as ||z||^ := sup||^|j<]^ {z,x). A function / is 
Lo-Lipschitz with respect to the norm ||-|| over X if 

|/(:r)-/(y)| <Lo||x-y|| 

for all x,y £ X. For convex /, it is known [12] that / is Lo-Lipschitz in this sense if and only if 
suPajgA" 11^/(^)11* — -^0- We say the gradient of / is Li-Lipschitz continuous with respect to the 
norm ||-|| over X if 

||V/(x) - Vf{y)\l < Li \\x - y\\ for x,yeX. 
A function ijj is strongly convex with respect to a norm ||-|| over X if for all x,y, G X, 

1 2 

^(y) > ip{x) + (VV'(x), y - x) + -\\x- y\\ . 

Given a convex and differentiable function ijj, the associated Bregman divergence [5] is given by 
D^{x,y) := ^p{x) — ip{y) — {'Vip{y),x — y). When X G is a matrix, we let Pi{X) denote 

its ith largest singular value, and when X G M'^^'^, we let Xi{X) denote its ith largest eigenvalue 
by modulus. The transpose of X is denoted X~^ . The notation ^ ~ P indicates that ^ is drawn 
according to the distribution P. 

2 Main results and some consequences 

In this section, we begin by motivating the algorithm studied in this paper, and then state our 
main results on its convergence behavior. 
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2.1 Some background 



We focus on stochastic gradient descent methods^ based on dual averaging schemes [27] for solving 
the stochastic problem (2). Dual averaging methods are based on a proximal function ip, which 
is assumed strongly convex with respect to a norm ||-||. The update scheme of such a method is 
as follows. Given a point xt £ X, the algorithm queries a stochastic oracle and receives a random 
vector gt such that ^[gt] G df{xt). The algorithm then performs the update 



Xt+l 



argmm 



V ^ — n 



X) + 



1 

at 



(4) 



where at > is a sequence of stepsizes. Under some mild assumptions, the algorithm is guaranteed 
to converge for stochastic problems. For instance, suppose that ip is strongly convex with respect to 
the norm ||-||, and moreover that E[||5t||^] < for all t, where we recall that ||-||^ denotes the dual 
norm to ||-||. Then, with stepsizes at oc R/Lq^/I, it is known that the sequence {xtj^g generated 
by the updates (4) satisfies 



E 



T 



Xt 



t=l 



fix*) 



O 



x^ 



Vt 



(5) 



We refer the reader to papers by Nesterov [27] and Xiao [36] for results of this type. 

An unsatisfying aspect of the bound (5) is the absence of any role for the variance of the 
(sub)gradient estimator gt. In particular, even if an algorithm is able to obtain m > 1 samples of 
the gradient of / at xt — thereby giving a significantly more accurate gradient estimate — this result 
fails to capture the likely improvement of the method. We address this problem by stochastically 
smoothing the non-smooth objective / and then adapt recent work on so-called "accelerated" 
gradient methods [19, 34, 36] to achieve variance-based improvements. Accelerated methods work 
only when the function / is smooth — that is, when it has Lipschitz continuous gradients. Thus, we 
turn now to developing the tools necessary to stochastically smooth the non-smooth objective (2). 



2.2 Description of algorithm 

Our algorithm is based on observations of stochastically perturbed gradient information at each 
iteration, where we slowly decrease the perturbation as the algorithm proceeds. More precisely, 
our algorithm uses the following scheme. Let {ut} C M+ be a non-increasing sequence of positive 
real numbers; these quantities control the perturbation size. At iteration t, rather than query the 
stochastic oracle at the point yt, the algorithm queries the oracle at m points drawn randomly from 
some neighborhood around yt- Specifically, it performs the following three steps: 

(1) Draws random variables {Zi^t}iLi ^in i.i.d. manner according to the distribution /i. 

(2) Queries the oracle at the m points yt + utZi^t, i = 1, 2, . . . , m, yielding stochastic gradients 

gi^t G dF{yt + utZi^t, Ci,t), where (,i^t ~ P, for i = 1, 2, . . . , m. (6) 

^We note in passing that essentially identical results can also be obtained for methods based on mirror descent [25, 
34], though we omit these so as not to overburden the reader. 
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(3) Computes the average Qt = ^ YlT=i 



9i,t- 



Here and throughout we denote the distribution of the random variable UfZ by /if, and we note that 
this procedure ensures E.[gt \ yt] = Vffitivt) = VE[F(yj + utZ;^) \ yt], where /^^ is the smoothed 
function (3) and fit is the density of Ut- 

By combining the samphng scheme (6) with extensions of Tseng's recent work on accelerated 
gradient methods [34], we can achieve stronger convergence rates for solving the non-smooth ob- 
jective (2). The update we propose is essentially a smoothed version of the simpler method (4). 
The method uses three series of points, denoted {xt,yt,zt} G X^. We use yt as a "query point", so 
that at iteration t, the algorithm receives a vector gt as described in the sampling scheme (6). The 
three sequences evolve according to a dual-averaging algorithm, which in our case involves three 
scalars {Lt,9t,r]t) to control step sizes. The recursions are as follows: 

yt = {l- 9t)xt + Otzt (7a) 
zt+i = argmin j V ;^ {gr, x) +^2 + ^t+ii^ix) + ^^il^ix) \ (7b) 

xt+i = (1 - 9t)xt + Otzt+i. (7c) 

In prior work on accelerated schemes for stochastic and non-stochastic optimization [34, 19, 36], 
the term Lt is set equal to the Lipschitz constant of V/; in contrast, our choice of varying Lt 
allows our smoothing schemes to be oblivious to the number of iterations T. The extra damping 
term r}t / Ot provides control over the fluctuations induced by using the random vector gt as opposed 
to deterministic subgradient information. As in Tseng's work [34], we assume that = 1 and 

(1 - 6t)/9t = l/9t^i; the latter equality is ensured by setting 9t = 2/{l + Jl + A/9f_^). 



2.3 Convergence rates 

We now state our two main results on the convergence rate of the randomized smoothing proce- 
dure (6) with accelerated dual averaging updates (7a)-(7c). So as to avoid cluttering the theorem 
statements, we begin by stating our main assumptions and notation. When we state that a function 
/ is Lipschitz continuous, we mean with respect to the norm ||-||, whose dual norm we denote ||-||^, 
and we assume that -0 is nonnegative and strongly convex with respect to || • || . Our main assumption 
ensures that the smoothing operator and smoothed function are relatively well-behaved. 

Assumption A (Smoothing properties). The random variable Z is zero-mean with density fi (with 
respect to Lebesgue measure on the affine hull aS[X) of X), and there are constants Lq and Li such 
that for allu > 0, ¥,[f{x+uZ)] < f{x)+LQU, andE,[f{x+uZ)] has ^-Lipschitz continuous gradient 
with respect to the norm \\-\\. For P- almost every ^ G H, we have domF(-;^) 5 supp/U -|- A'. 

Let /ij denote the density of the random vector utZ and define the instantaneous smoothed func- 
tion = J f{x + z)dfjit{z)- As discussed in the introduction, the function /^^ is guaranteed to 
be smooth whenever /t (and hence /Xf) is a density with respect to Lebesgue measure, so Assump- 
tion A ensures that /^^ is uniformly close to / and not too "jagged." Many smoothing distri- 
butions, including Gaussians and uniform distributions on norm balls, satisfy Assumption A (see 
Appendix E); we use such examples in the corollaries to follow. The containment of uq supp fi+X m. 
domF(-;^) guarantees that the subdifferential dF{-;^) is non-empty at all sampled points yt + utZ. 
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Indeed, since /i is a density with respect to Lebesgue measure on afF(A'), with probabihty one 
Ut + utZ G rehnt dom F{-;^) and thus [12] the subdifferential dF{yt + UtZ; ^) ^ 0. There are many 
smoothing distributions fj,, including standard Gaussian and uniform distributions on norm balls, 
for which Assumption A holds (see Appendix E), and we use such examples in the corollaries to 
follow. 

In the algorithm (7a)~(7c), we set Lt to be an upper bound on the Lipschitz constant of the 
gradient of + utZ)]; this choice ensures good convergence properties of the algorithm. The 

following is the first of our main theorems. 

Theorem 1. Define ut = 6tu, use the scalar sequence Lt = Li/ut, and assume that rjt is non- 
decreasing. Under Assumption A, for any x* £ X and T > 4, 

T-l 



E /(xT + XT - [fix + ip{x < + + ;^ V — E et , + (8 

where et : = V/^j(yt) — gt is the error in the gradient estimate. 



Remarks: Note that the convergence rate (8) involves the variance E[||e(||^] explicitly. We exploit 
this fact in the corollaries to be stated shortly. In addition, note that Theorem 1 does not require a 
priori knowledge of the number of iterations T to be performed, which renders it suitable to online 
and streaming applications. If such knowledge is available, then it is possible to give a similar re- 
sult using the smoothing parameter ut = u for all t; such a result is stated as Theorem 3 in Section 4. 

The preceding result, which provides convergence in expectation, can be extended to bounds 
that hold with high probability under suitable tail conditions on the error et : = ^ futiut) — dt- In 
particular, let Tt denote the cr-field of the random variables gi^s^ i = 1, ■ ■ ■ ,m and s = 0, . . . ,t. In 
order to achieve high-probability convergence results, a subset of our results involve the following 
assumption. 

Assumption B (Sub-Gausian errors). The error is (||-||^ sub-Gaussian for some a > 0, mean- 
ing that with probability 1 

E[exp(||et||2/'7') l-^t-i] <exp(l) for all t € {1,2,3, ...} . (9) 

We refer the reader to Appendix F for more background on sub-Gaussian and sub-exponential 
random variables. In past work on smooth optimization, other authors [15, 19, 36] have imposed 
this type of tail assumption, and we discuss sufficient conditions for the assumption to hold in 
Corollary 4 in the following section. 

Theorem 2. In addition to the conditions of Theorem 1, assume X is compact with \\x — x*\\ < R 
for all X £ X and that Assumption B holds. Then with probability at least 1 — 25, the algorithm 
with step size rjt = r^-v/F+T satisfies 

f( ^ I / \ \f( *\ ^ ( *M ^ 4LoM 47?rV'(x*) _^ . 1 ^m, ||2n 

f{xT) + (P[xt) - [f[x ) + ip{x )] < + -j^ + ^ + 6't-i 2^ — E[||et||J 



4a2 max {log i, ^ 2e(log T + 1) log i} aR^fk^ 



6 



Remarks: The first four terms in the convergence rate Theorem 2 gives are essentially identical 
to the expected rate of Theorem 1. The first of the additional terms decreases at a rate of 1/T, 
while the second decreases at a rate of al\fT. As we discuss in the Corollaries that follow, the 
dependence a/^/T on the variance o"^ is optimal, and an appropriate choice of the sequence ryj in 
Theorem 1 yields the same rates to constant factors. 



2.4 Some consequences 

The corollaries of the above theorems — and the consequential optimality guarantees of the algo- 
rithm above — are our main focus for the remainder of this section. Specifically, we show concrete 
convergence bounds for algorithms using different choices of the smoothing distribution /i. For each 
corollary, we make the assumption that x* ^ X satisfies ij){x*) < B?, but is otherwise arbitrary, 
that the iteration number T > 4, and that ut = uOt- 

We begin with a corollary that provides bounds when the smoothing distribution /i is uniform 
on the ^2-ball. The conditions on F in the corollary hold, for example, when is Lg-Lipschitz 

with respect to the ^2-norm for P-a.e. sample of ^. 

Corollary 1. Let ii he uniform on B 2 (0,1) and assume E[\\dF{x;^)\\2] < Lq for x £ X + B2{0,u), 
where we set u = Rd^/^ . With the step size choices rjt = L^yjt + l/R^/m and Lt = LoVd/ut, 

E[fixT) + ipM] - [fix*) + ^(x*)] < 12^0^ 




The following corollary shows that similar convergence rates are attained when smoothing with the 
normal distribution. 

Corollary 2. Let fi be the d-dimensional normal distribution with zero-mean and identity covari- 
ance I and assume that -F(-;0 ^-^ L^-Lipschitz with respect to the i2-norm for P-a.e. ^. With 
smoothing parameter u = Rd^^/^ and step sizes rjt = Lo\/t + 1/Ry/m and Lt = Lq/ui, we have 

E[fixr) + ^(xt)] - [fix*) + ^(x*)] < l^LoRd^ + 



We remark here (deferring deeper discussion to Lemma 10) that the dimension dependence of d^^^ 
on the 1/T term in the previous corollaries cannot be improved by more than a constant factor. 
Essentially, functions / exist whose smoothed version cannot have both Lipschitz continuous 
gradient and be uniformly close to / in a dimension-independent sense, at least for the uniform or 
normal distributions. 

The advantage of using normal random variables — as opposed to Z uniform on B2{0,u) — is 
that no normalization of Z is necessary, though there are more stringent requirements on /. The 
lack of normalization is a useful property in very high dimensional scenarios, such as statistical 
natural language processing [23]. Similarly, we can sample Z from an -£qo ball, which, like B2(0^ u) ^ 
is still compact, but gives slightly looser bounds than sampling from B2{0,u). Nonetheless, it is 
much easier to sample from Boo{0,u) in high dimensional settings, especially sparse data scenarios 
such as NLP where only a few coordinates of the random variable Z are needed. 

There are several objectives f + <f with domains X for which the natural geometry is non- 
Euclidean, which motivates the mirror descent family of algorithms [25]. By using different dis- 
tributions /U for the random perturbations Zi^t in (6), we can take advantage of non-Euclidean 
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geometry. Here we give an example that is quite useful for problems in which the optimizer x* is 
sparse; for example, the optimization set X may be a simplex or £i-ball, or (p{x) = A ||x||^. The 
idea in this corollary is that we achieve a pair of dual norms that may give better optimization 
performance than the £2-^2 pair above. 

Corollary 3. Let fj, be the uniform density on Boo{0,l) and assume that Lo-Lipschitz 
continuous with respect to the ii-norm over X + i?oo(0, u) for ^ S H, where we set u = Ry/d log d. 
Use the proximal function ip{x) = 2(p-i) W^Wp for p = 1 + 1/ log d and set rjt = LQ\/t + l/Ry/mlogd 
and Lt = Lq/ui- There is a universal constant C such that 

WixT) + (P[xt)] - [f[x ) + ip{x )] < C — - — + C 



^ . Lq \\x*\\-^ y/dlogd ^ Lq llx^ll^logd 
T y/Tm 



The dimension dependence of y/ d log d on the leading 1/T term in the corollary is weaker than the 
(i^/^ dependence in the earlier corollaries, so for very large m the corollary is not as strong as one 
desires when taking advantage of non-Euclidean geometry. Nonetheless, for large T, the 1/VTm 
terms dominate the convergence rates, and Corollary 3 can be an improvement. 

Our final corollary specializes the high probability convergence result in Theorem 2 by showing 
that the error is sub-Gaussian (9) under the assumptions in the corollary. We state the corollary 
for problems with Euclidean geometry, but it is clear that similar results hold for non-Euclidean 
geometry as above. 

Corollary 4. Assume that -F(-;C) LQ-Lipschitz with respect to the £2-norm. Let jp{x) = | ||a;||2 
and assume that X is compact with \\x — x*\\2 < R for x,x* G X. Using smoothing distribution 
H uniform on .62(0, 1), smoothing parameter u = Rd^^^ , damping parameter rjt = L^yjt + 1/Ry/m, 
and instantaneous Lipschitz estimate Lt = LQ^fd/ut, with probability at least 1 — 5, 

f{xT)+'p{xT)-f{x*)-'p{x*) 

= o( ^0-^^'^" + + ^ Lpfimax jlog hlogT}\ 



m 



Remarks: We make two remarks about the above corollaries. The first is that if one abandons 
the requirement that the optimization procedure be an "anytime" algorithm — always able to re- 
turn a result — it is possible to give similar results by using a fixed setting of ut = u throughout. 
In particular, using Theorem 3 in Section 4.4 we can use ut = u/T to get essentially the same 
results as Corollaries 1-3. As a side benefit, it is then easier to satisfy the Lipschitz condition that 
E[\\dF{x;^)f] <Llioi x£X + supp /i. Our second observation is that Theorem 1 and the corol- 
laries appear to require a very specific setting of the constant Lt to achieve fast rates. However, the 
algorithm is in fact robust to mis-specification of Lt, since the instantaneous smoothness constant 
Lt is dominated by the stochastic damping term rjt in the algorithm. Indeed, since r]t grows propor- 
tionally to \/t for each corollary, we always have Lt = Li/ut = Li/6tu = 0{r]t/y/i9t); that is, Lt is 
order y/t smaller than rjt/Ot, so setting Lt incorrectly up to order \/t has essentially negligible effect. 
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We can show the bounds in the theorems above are tight, that is, unimprovable up to constant 
factors, by exploiting known lower bounds [25, 1] for stochastic optimization problems. We re-state 
some of these results here. For instance, let X = {x € M.'^ \ \\x\\2 < R2}, and consider all convex 
functions / that are Lo,2-Lipschitz with respect to the ^2-iiorm. Assume that the stochastic oracle, 
when queried at a point x, returns a vector g whose expectation is in df{x) with E[||5f||2] < -^^02- 
Then for any method that outputs a point xt ^ X after T queries of the oracle, we have the lower 
bound 

sup {e[/(xt)] - min f{x)} = n 

where the supremum is taken over convex / that are Lo,2-Lipschitz with respect to the ^2-iiorm [1, 
Section 3.1]. Similar bounds hold for problems with non-Euclidean geometry [1]; in particular, 
consider convex / that are Lo,oo-Lipschitz with respect to the ^i-norm, that is, 1/(2;) — f{y)\ < 
Lo,oo \\x — y\\i- Then setting X = {x € R'^ \ \\x\\i < Ri}, we have BooiO,Ri/d) C i?i(0,i?i) and 
thus 

sup {e[/(xt)] - min/(x)} = Q ^^^^ 

In either geometry, no method can have optimization error smaller than 0{LR/ \/T) after T queries 
of the stochastic oracle. 

Let us compare the above lower bounds to the convergence rates in Corollaries 1 through 3. 
Examining the bound in Corollaries 1 and 2, we see that the dominant terms are order LqR/^/T m so 
long as m < T/Vd. Since our method issues Tm queries to the oracle, this is optimal. Similarly, the 
strategy of sampling uniformly from the £oo-ball in Corollary 3 is optimal up to factors logarithmic 
in the dimension. In contrast to other optimization procedures, however, our algorithm performs 
an update to the parameter xt only after every m queries to the oracle; as we show in the next 
section, this is beneficial in several applications. 



3 Applications and experimental results 

In this section, we describe some applications of our results, and then give experimental results 
that illustrate our theoretical predictions. 



3.1 Some applications 

The first application of our results is to parallel computation and distributed optimization. Imagine 
that instead of querying the stochastic oracle serially, we can issue queries and aggregate the 
resulting stochastic gradients in parallel. In particular, assume that we have a procedure in which 
the m queries of the stochastic oracle occur concurrently. Then Corollaries 1-4 imply that in the 
same amount of time required to perform T queries and updates of the stochastic gradient oracle 
serially, achieving an optimization error of 0{l/y/T), the parallel implementation can process Tm 
queries and consequently has optimization error 

We now briefly describe two possibilities for a distributed implementation of the above. The 
simplest architecture is a master- worker architecture, in which one master maintains the parameters 
{xt,yt,zt), and each of m workers has access to an uncorrelated stochastic oracle for P and the 
smoothing distribution fi. The master broadcasts the point yt to the workers, which sample ~ P 
and Zi ~ fi, returning sample gradients to the master. In a tree-structured network, broadcast and 
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aggregation require at most 0{logm) steps; the relative speedup over a serial implementation is 
0{m/ log m). In recent work, Dekel et al. [9] give a series of reductions showing how to distribute 
variance-based stochastic algorithms and achieve an asymptotically optimal convergence rate. The 
algorithm given here, as specified by equations (6) and (7a)-(7c), can be exploited within their 
framework to achieve an 0{m) improvement in convergence rate over a serial implementation. 
More precisely, whereas achieving optimization error e requires 0(l/e^) iterations for a centralized 
algorithm, the distributed adaptation requires only 0(l/(me^)) iterations. Such an improvement 
is possible as a consequence of the variance reduction techniques we have described. 

A second application of interest involves problems where the set X and/or the function ip are 
complicated, so that calculating the proximal update (7b) becomes expensive. The proximal update 
may be distilled to computing 

min { {g, x) + 'il){x)\ or min { {g, x) + ip{x) + ip{x)\ . (10) 

In such cases, it may be beneficial to accumulate gradients by querying the stochastic oracle several 
times in each iteration, using the averaged subgradient in the update (7b), and thus solve only one 
proximal sub-problem for a collection of samples. 

Let us consider some concrete examples. In statistical applications involving the estimation of 
covariance matrices, the domain X is constrained in the positive semidefinite cone {X G §n I ^ ^ 
0}; other applications involve additional nuclear-norm constraints of the form X = {X G ]R'^i><'^2 | 
^min{di,d2} p.(^x) < C}. Examples of such problems include covariance matrix estimation, matrix 
completion, and model identification in vector autoregressive processes (see the paper [24] and 
references therein for further discussion). Another example is the problem of metric learning [37, 33], 
in which the learner is given a set of n points {ai, . . . , a„} C M'^ and a matrix B S M"^" indicating 
which points are close together in an unknown metric. The goal is to estimate a positive semidefinite 
matrix X ^ such that ((oj — aj),X{ai — Uj)) is small when and aj belong to the same class 
or are close, while ((oj — aj),X{ai — aj)) is large when and Oj belong to different classes. It 
is desirable that the matrix X have low rank, which allows the statistician to discover structure 
or guarantee performance on unseen data. As a concrete illustration, suppose that we are given a 
matrix B £ {—1, 1}"-^", where bij = 1 if Oj and Uj belong to the same class, and bij = —1 otherwise. 
In this case, one possible optimization-based estimator involves solving the non-smooth program 



mm 



X,. (^) 



l^y^\l + b^j{tr{X{ai-aj){ai-aj)'^) + x)\ s.t. XtO, tr(X) < C. (11 



Now let us consider the cost of computing the projection update (10) for the metric learning 
problem (11). When tp{X) = ^ \\X\\p^, the update (10) reduces for appropriate choice of V to 



1 9 

min - \\X - V\\p^ subject to X y 0, tv{X) < C. 
X z 



(As a side- note, it is possible to generalize this update to Schatten p-norms [10].) The above 
problem is equivalent to projecting the eigenvalues of V to the simplex {x G | x ^ 0, < 
C}, which after an 0((i^) eigen-decomposition requires time 0{d) [6]. To see the benefits of the 
randomized perturbation and averaging technique (6) over standard stochastic gradient descent (4), 
consider that the cost of querying a stochastic oracle for the objective (11) for one sample pair 
requires time 0{(f). Thus, m queries require 0{md?) computation, and each update requires 
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0{d^). So we see that after Tmd? + Td? units of computation, our randomized perturbation 
method has optimization error 0{\/y/Tm), while standard stochastic gradient requires Tmd^ units 
of computation. In short, for m ^ d the randomized smoothing technique (6) uses a factor 0{d) 
less computation than standard stochastic gradient; we give experiments corroborating this in 
Section 3.2.2. 

3.2 Experimental results 

We now describe some experimental results that confirm the sharpness of our theoretical predictions. 
The first experiment explores the benefit of using multiple samples m when estimating the gradient 
'Vf{yt) as in the averaging step (6). The second experiment studies the actual amount of time 
required to solve a statistical metric learning problem, as described in the objective (11) above. 
The third investigates whether the smoothing technique is essential to algorithms solving non- 
smooth stochastic problems — that is, whether the smoothing is only a proof device or whether it 
is necessary to achieve good performance. 

3.2.1 Iteration Complexity of Reduced Variance Estimators 

In this experiment, we consider the number of iterations of the accelerated method (7a)-(7c) re- 
quired to achieve an e-optimal solution to the problem (2). To understand how the iteration scales 
with the number m of gradient samples, it can be useful to consider our results in terms of the 
number of iterations T(e, m) required to achieve optimization error e for the optimization procedure 
when using m gradient samples in the averaging step (6). Specifically, we define 

r(e,m) = inf (t G N I f{xt) - min/(x*) < e). 

We focus on the algorithm analyzed in Corollary 1, which uses uniform sampling of the ^2-ball. The 
theorem implies there should be two regimes of convergence: one when the number m of samples is 
small, so that the LoR/y/Tm term is dominant, and the other when m is large, so the LoRd^/^/T 
term is dominant. By inverting the first term, we see that for small m, T{e,m) = 0{L'^R? /me^), 
while the second gives r(e,m) = 0{LoRd^/'^/e). In particular, our theory predicts 

r(e,m) = max<^-^,^^- U. (12) 

To assess the accuracy of the prediction (12), we consider a robust linear regression problem, 
commonly studied in system identification and robust statistics [29, 14]. Specifically, we have a 
matrix A G R"^*^ and vector b G M", and seek to minimize 

1 1 " 

fix) = - \\Ax - = - V I {ai,x) - bi\, (13) 

4 = 1 

where aj G K"^ denotes a transposed row of A. It is clear that the problem (13) is non-smooth. The 
stochastic oracle in this experiment, when queried at a point x, chooses an index i G [n] uniformly 
at random and returns sign((aj, x) — bi)ai. 

To perform our experiments, we generate n = 1000 points in dimensions d G {50 • 2*}^^q, each 
with fixed norm ||aj||2 = Lq, and then assign values bi by computing {ai,w) for a random vector 
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Figure 1. The number of iterations T{e,m) to achieve an e-optimal solution for the problem (13) 
as a function of the number of samples m used in the gradient estimate (6). The prediction (12) is 
the square black line in each plot, and each plot shows results for different dimensions d: (a) d = 50, 
(b) d = 100, (c) d = 200, (d) d = 400 



w (adding normally distributed noise with variance 0.1). We estimate the quantity T(e,m) for 
solving the robust regression problem (13) for several values of m and d. Figure 1 shows results 
for dimensions d G {50,100,200,400}, averaged over 20 experiments for each choice of dimension 
d. (Other settings of d exhibited similar behavior.) Each panel in the figure shows — on a log- 
log scale — the experimental average T{e,m) and the theoretical prediction (12). The decrease in 
T(e, m) is nearly linear for smaller numbers of samples m; for larger m, the effect is quite diminished. 
We present numerical results in Table 1 that allow us to roughly estimate the number m at which 
increasing the batch size in the gradient estimate (6) gives no further improvement. For each 
dimension d, Table 1 indeed shows that from ?n = 1 to 5, the iteration count T(e, m) decreases 
linearly, halving again when we reach m ~ 20, but between m = 100 and 1000 there is at most 
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m 


1 


2 


3 


5 


20 


100 


1000 


10000 


, _„ Mean 
d = 50 

Std 


612.2 
158.29 


252.7 
34.67 


195.9 
38.87 


116.7 
13.63 


66.1 
3.18 


52.2 
1.66 


47.7 
1.42 


46.6 
1.28 


, . „ „ Mean 
d = 100 

Std 


762.5 
56.70 


388.3 
19.50 


272.4 
17.59 


193.6 
10.65 


108.6 
1.91 


83.3 
1.27 


75.3 
0.78 


73.3 
0.78 


, „„„ Mean 
d = 200 

Std 


1002.7 
68.64 


537.8 
26.94 


371.1 
13.75 


267.2 
12.70 


146.8 
1.66 


109.8 
1.25 


97.9 
0.54 


95.0 
0.45 


Mean 

a = 400 

Std 


1261.9 
60.17 


656.2 
38.59 


463.2 
12.97 


326.1 
8.36 


178.8 
2.04 


133.6 
1.02 


118.6 
0.49 


115.0 
0.00 


<i = 800 

Std 


1477.1 
44.29 


783.9 
24.87 


557.9 
12.30 


388.3 
9.49 


215.3 
2.90 


160.6 
0.66 


142.0 
0.00 


137.4 
0.49 


d= 1600 ^/^^ 
Std 


1609.5 
42.83 


862.5 
30.55 


632.0 
12.73 


448.9 
8.17 


251.5 
2.73 


191.1 
0.30 


169.4 
0.49 


164.0 
0.00 



Table 1. The number of iterations r(e, m) to achieve e-accuracy for the regression problem (13) as 
a function of number of gradient samples m used in the gradient estimate (6) and the dimension d. 
Each box in the table shows the mean and standard deviation of T(e, m) measured over 20 trials. 



an 11% difference in T(e, m), while between m = 1000 and m = 10000 the decrease amounts to at 
most 3%. The good qualitative match between the iteration complexity predicted by our theory 
and the actual performance of the methods is clear. 



3.2.2 Metric Learning 

Our second set of experiments apply to instances of metric learning. The data we receive consists 
of a set of vectors Oj S R*^ and measures bij > of the similarity between the vectors Oj and aj (here 
hij = means that and Oj are the same). The statistical goal is to learn a matrix X — inducing a 
pseudo-norm via ||a||^ := {a,Xa) — such that ((oj — aj),X{ai — aj)) bij. Consequently, we solve 
the regression-like problem 

f(X) = T^T X] (^i^i - - (^jV) - bij subject to tr(X) < C, X ^0. 

(2) 



The stochastic oracle for this problem is simple: given a query matrix X, the oracle chooses a pair 
uniformly at random, then returns the subgradient 

sign [{{ai - aj),X{ai - aj)) - bij] (a^ - aj){ai - aj)~^ . 

We solve ten random problems with dimension d = 100 and n = 2000, giving an objective with 
4 • 10^ terms and 5050 parameters. We plot experimental results in Fig. 2 showing the optimality 
gap f{Xt) — mfx*ex f{X*) as a function of computation time. We plot several lines, each of which 
captures the performance of the algorithm using a different number m of samples in the smoothing 
step (6). It is clear that performing stochastic optimization is more viable for this problem than a 
non-stochastic method, as even computing the objective requires 0{'n?d'^) operations. As predicted 
by our theory and discussion in Sec. 3, it is clear that receiving more samples m gives improvements 
in convergence rate as a function of time. Our theory also predicts that for m > d, there should 
be no improvement in actual time taken to minimize the objective; the plot in Fig. 2 suggests that 
this too is correct, as the plots for m = 64 and m = 128 are essentially indistinguishable. 
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Figure 2. Optimization error f{Xt) — infx'^x f{X*) in the metric learning problem of Sec. 3.2.2 
as a function of time in seconds. Each line indicates optimization error over time for a particular 
number of samples m in the gradient estimate (6); we set m = 2* for i = {1, . . . , 7}. 



3.2.3 Necessity of randomized smoothing 

A reasonable question is whether the extra sophistication of the random smoothing (6) is necessary. 
Can receiving more samples m from the stochastic oracle — all evaluated at the same point — give 
the same benefit to the simple dual averaging method (4)? We do not know the full answer to 
this question, though we give an experiment here that suggests that the answer is negative, in that 
smoothing does give demonstrable improvement. 
For this experiment, we use the objective 



where the Oj G {—1,+!}'^, and each component j of the vector Oj is sampled independently from 
{ — 1, 1} and equal to 1 with probability l/VJ. Even as n t oo, the function / remains non-smooth, 
since the Oj belong to a discrete set and each value of Oj occurs with positive probability. As in 
Sec. 3.2.1, we compute T{e,m) to be the number of iterations required to achieve an e-optimal 
solution to the objective (14). We compare two algorithms that use m queries of the stochastic 
gradient oracle, which when queried at a point x chooses an index i G [n] uniformly at random 
and returns sign(x — Oj) £ d \\x — ai\\^. The first algorithm is the dual averaging algorithm (4), 
where gt is the average of m queries to the stochastic oracle at the current iterate xt- The second 
is the accelerated method (7a)-(7c) with the randomized averaging (6). We plot the results in 
Fig. 3. We plot the best stepsize sequence at for the update (4) of several we tested to make 
comparison as favorable as possible for simple mirror descent. It is clear that while there is moderate 
improvement for the non-smooth method when the number of samples m grows, and both methods 
are (unsurprisingly) essentially indistinguishable for m = 1, the smoothed sampling strategy has 
much better iteration complexity as m grows. 
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Figure 3. Tlic number of iterations T{e,m) to achieve an e-optimal solution to (14) for simple 
mirror descent and the smoothed gradient method. 



4 Proofs 

In this section, we provide the proofs of Theorems 1 and 2, as well as Corollaries 1 through 4. We 
begin with the proofs of the corollaries, after which we give the full proofs of the theorems. In both 
cases, we defer some of the more technical lemmas to appendices. 

The general technique for the proof of each corollary is as follows. First, we recognize that the 
randomly smoothed function fp,{x) = E/(x + Z) for Z ~ /i has Lipschitz continuous gradients and 
is uniformly close to the original non-smooth function /. This allows us to apply Theorems 3 or 1. 
The second step is to realize that with the sampling procedure (6), the variance E ||et||^ decreases at 
a rate of approximately 1/m, the number of gradient samples. Choosing the stepsizes appropriately 
in the theorems then completes the proofs. Proofs of these corollaries require relatively tight control 
of the smoothness properties of the smoothing convolution (3), so we refer frequently to several 
lemmas stated in Appendix E. 



4.1 Proof of Corollaries 1 and 2 

We begin by proving Corollary 1. Recall the averaged quantity gt = ^^^i gi,t, and that gi^t S 
dF(yt + utZi\^i), where the random variables Zi are distributed uniformly on the ball i?2(0, 1). 
From Lemma 8 in Appendix E, the variance of gt as an estimate of ^ f^t{yt) satisfies 

:= E[||e,||^] = - ^ f MWl] < ^- (15) 

m 

Further, for Z distributed uniformly on -62(0, 1), we have the bound 

f{x)<E[f{x + uZ)]<f{x) + Lou, 
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and moreover, the function x i— )• Kfj,\f{x + uZ)] has Lo\/d/'U-Lipschitz continuous gradient. Thus, 
applying Lemma 8 and Theorem 1 with the setting = Lo^/d/udf, we obtain 



(16) 



E[/(xt) + ip{xT)] - [fix ) + ip{x )] < — + -^r- + f2^-- — + -j^^ 

where we have used the bound (15). 

Recah that r]t = LQ\/t + 1/Ry/m by construction. Coupled with the inequality 

V ^ < 1 + / ^dt = 1 + 2{Vt - 1) < 2Vr, 
we use that 2VT + l/T + 2|^fT < 5/VT to obtain 

E[/(xr) + <p(xr)] - [f{x ) + (^(x )] < — + -j== + -j^. 

Substituting the specified setting of u = Rd^^^ completes the proof. 

The proof of Corollary 2 is essentially identical, differing only in the setting of u = Rd~^^^ and 
the application of Lemma 9 instead of Lemma 8 in Appendix E. 



4.2 Proof of Corollary 3 

Under the stated conditions of the corollary. Lemma 6 implies that when /x is uniform on -Boo(0, u), 
then the function //^(x) := E,^[f{x + Z)] has Lo/u-Lipschitz continuous gradient with respect to 
the ^i-norm, and moreover it satisfies the upper bound ff,{x) < f{x) + Fixx e X and let 

gi G dF{x + Zi;^i), with g = ^ Yl^i 9i- We claim that for any u the error satisfies 

E[lb-V/,(x)||^]<C^^ (17) 

for some universal constant C. Indeed, Lemma 6 shows that E[(7] = V/^(x); moreover, component j 
of the random vector gi is an unbiased estimator of the jth component of V/^(x). Since H^iHoo < 
and ||V/^(x)||^ < Lq, the vector g^ — V/^(x) is a d-dimensional random vector whose components 
are sub-Gaussian with sub-Gaussian parameter 4Lq. Conditional on x, the gi are independent, so 
g — V/^(x) has sub-Gaussian components with parameter at most ALQ/m. Applying Lemma 14 
(see Appendix F) with X = g — Vf^{x) and o"^ = ALq/tti yields the claim (17). 

1 1 1 2 

Now, as in the proof of Corollary 1, we can apply Theorem 1. Recall that 2(p-i) 11-^ Hp 
strongly convex over with respect to the £p-norm for any p G (1, 2] [25]. Thus, with the choice 
^/^(x) = 2(p-i) W-^Wp p = 1 + l/logd, it is clear that the squared radius R^ of the set X is order 

1 1 2 1 1 2 

||x*||plogd < log d. All that remains is to relate the Lipschitz constant Lq with respect to 

the ii norm to that for the ip norm. Let q be conjugate to p, that is, 1/q + 1/p = 1. Under the 
assumptions of the theorem, we have q = 1 + log d. For any g G M'^, we have \\g\\g < d^^"^ WdWoo- Of 
course, ^^/^"srf+i) < (fi/(iogrf) = exp(l), so H^H^ < e H^H^. 
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Having shown that the Lipschitz constant L for the Ip norm satisfies L < Lqc, where Lq is the 
Lipschitz constant with respect to the ii norm, we apply Theorem 1 and the variance bound (17) 
to obtain the result. Specifically, Theorem 1 implies 

w\ff \^ f M \ff *\^ 6LoB? 2r]TB^ I Lglogd ALpdu 

E /(xT + ^(xT)] - [f[x ) + ifix ) < — — + +TfZ^ + -rp^- 

lu 1 1 ^-^ m m 21 

t=o ' 

Plugging in u, r]t, and R < \\x*\\-^ \/log d and using bound (16) completes the proof. 
4.3 Proof of Corollary 4 

The proof of this corollary requires an auxiliary result showing that Assumption B holds under the 
stated conditions. The following result does not appear to be well-known, so we provide a proof in 
Appendix A. In stating it, we recall the definition of the sigma field J^t from Assumption B. 

Lemma 1. In the notation of Theorem 2, suppose that -F(-;0 is LQ-Lipschitz continuous with 
respect to the norm \\-\\ over X + uq supp /U for P-a.e. ^. Then 

/II Ii2^ 

E 



< exp(l), where : = 2max <j E[||et||^ | ^ " 



m 



Using this lemma, we now prove Corollary 4. When /i is the uniform distribution on i?2(0,ti). 
Lemma 8 from Appendix E implies that V/^j is Lipschitz with constant Li = hQ^fdju. Lemma 1 
ensures that the error Cf satisfies Assumption B. Noting the inequality 

max {log(l/(5), ^(1 + log T) log(l/5) } < max{log(l/<5), 1 + logT} 

and combining the bound in Theorem 2 with Lemma 1 , we see that with probability at least 1 — 2(5 

/(rET) + ^(xT)-/(x*)-v^(x*) 



^ QLpR^Vd ^ ALqu ^ 4R^r] ^ 2Lg ^ ^ Lg max {log |, log T} ^ LpR^log i 
~ Tu T VtTT mVTi] {T + l)mr] 

for a universal constant C. Setting rj = LQ/R^/m and u = Rd^/^ gives the result. 



4.4 Proof of Theorem 1 

This proof is more involved than that of the above corollaries. In particular, we build on techniques 
used in the work of Tseng [34], Lan [19], and Xiao [36]. The changing smoothness of the stochastic 
objective — which comes from changing the shape parameter of the sampling distribution Z in 
the averaging step (6) — adds some challenge. Essentially, the idea of the proof is to let /it be the 
density of utZ and define f^t{x) := E.^[f{x + utZ)], where ut is the non-increasing sequence of shape 
parameters in the averaging scheme (6). We show via Jensen's inequality that /(x) < f^f {x) < 
ff_it-iix) for all t, which is intuitive because the variance of the sampling scheme is decreasing. 
Then we apply a suitable modification of the accelerated gradient method [34] to the sequence of 
functions /^^ decreasing to /, and by allowing Uf to decrease appropriately we achieve our result. 
At the end of this section, we state a third result (Theorem 3), which gives an alternative setting 
for u given a priori knowledge of the number of iterations. 
We begin by stating two technical lemmas: 



17 



Lemma 2. Let f^^ he a sequence of functions such that f^^ has Lt-Lipschitz continuous gradients 
with respect to the norm \\-\\ and assume that /^t(x) < ff_Lt-i{x) for any x £ X. Let the sequence 
{xf, Ut, zt} be generated according to the updates (7a)-(7c), and define the error term et = ^ f^t{yt)~ 
gt- Then for any x* G X, 



* 1 2*1 

r=0 ^ r=0 

See Appendix B for the proof of this claim. 

Lemma 3. Let the sequence 6t satisfy = and 6q = \. Then 6t < j^, and Y1\=q ^ = ^• 
The second statement was proved by Tseng [34]; the first fohows by a straightforward induction. 

We now proceed with the proof. Recalhng f^t{x) = + utZ)], let us verify that f^t{x) < 

ffj.t-ii^) fo'^ ^ ^^'^ ^ s-Pply Lemma 2. Since ut < ut-i, we may define a random 

variable U G {0, 1} such that F{U = 1) = ^ G [0, 1]. Then 

f^,{x) = E[f{x + utZ)]=E[f{x + ut-iZE[U])] 

< F[U = 1] E[f{x + ut-iZ)] + F[U = 0] fix), 

where the inequality follows from Jensen's inequality. By a second application of Jensen's inequality, 
we have f{x) = f{x + Mt_iE[Z]) < E[f{x + ut-iZ)] = ff_n_^{x). Combined with the previous 
inequality, we conclude that f^t{x) < E[f{x + ut-iZ)] = f^^_^{x) as claimed. Consequently, we 
have verified that the function /^^ satisfies the assumptions of Lemma 2 where V/^^ has Lipschitz 
parameter Lt = Li/ut and error term et = ^ f^tiut) ~ 9t- We apply the lemma momentarily. 

Using Assumption A that /(x) > E[f{x + utZ)] — Lqui = f^^ {x) — Lqui for all x £ X, Lemma 3 
implies 

^IfixT) + V^ixT)] - J-[f{x*) + V^ix*)] 

= J-[fM + v{xt)] - E jifi^*) + ^(^*)] 

t=0 * 



T-1 t=0 * t=0 * 

which by the definition of ut is in turn bounded by 

T-l 



(xt) + 9^(xt)] - E UUt{x*) + ^{x*)]+TLou. (18) 
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Now we simply apply Lemma 2 to the bomid (18), which gives us 

J-[f{xT)+^{xT)-f{x*)-y^{x*)] 

T-1 ^ T-1 ^ 

< ltV(x*) + + E ^ 11^*11* + Y.-0- ^* - + ^^0^- (19) 

The non-probabilistic bound (19) is the key to the remainder of this proof, as well as the starting 
point for the proof of Theorem 2 in the next section. What remains is to take expectations in the 
bound (19). 

Recall the filtration of cr-fields J-t so that xt,yt,zt G J-t~i, that is, J-t contains the randomness 
in the stochastic oracle to time t. Since gt is an unbiased estimator of Vf^^d/t) by construction, we 
have K[gt \ J"t_i] = Vf^^{yt) and 

E[{et,zt-x*)]=E[E[{et,zt-x*)\Tt^i]] = E[{E[et \ Tt-i], zt - x*)] =0, 

where we have used the fact that zt is measurable with respect to J-t-i- Now, recall from Lemma 3 

2 
2+t 

32 



that 6*4 < and that (1 - 0t)/9f = l/0'^_^. Thus 



yt-i 1 1 2 + t 3 , 

= < ^ = < - for t > 4. 

0^ 1 - - 1 - 2TI * " 2 

Furthermore, we have 6t+i < 6t, so by multiplying both sides of our bound (19) by 0^_^ and taking 
expectations over the random vectors gt, 

E[f{xT) + ^{xT)]-[f{x*) + ^{x*)] 

< el^^LTi^{x*) + Or^ir^Ti^ix*) + ^ — E \\et\\l + ^t-i E ^^e*, " ^*)] + O^^.TLqu 



2vt 

t=0 ' t=0 



GLii^ix*) , 27?TV'(a;*) , l,^,, „2 , 4Lo7x 



where we used the fact that Lt = Li/ut = Li/Otu. This completes the proof of Theorem 1. 

As promised, we conclude with a theorem using a fixed setting of the smoothing parameter ut. 

Theorem 3. Suppose that ut = u for all t and set Lf = Li/u. With the remaining conditions as 
in Theorem 1, then for any x* £ X, we have 

Tffr#/ \^ f M ^f^ < *M ^ 4LiV;(x*) ^ 277r-0(rr*) 1 \\^^^T 
IE /(xT + ^{xt)\ - [fix + ip[x ) < — -2 + + 77; >^ — IE et + Lqu, 

t=0 

where et := Vf^{yt) - gt- 

Proof The proof is brief. If we fix = n for all t, then the bound (19) holds with the last term 
TLqu replaced by O^^-^^Lqu, which we see by invoking Lemma 3. The remainder of the proof follows 
unchanged, with Lt = Li for all t. □ 

It is clear that by setting u oc 1/T, the rates achieved by Theorem 1 and Theorem 3 are identical 
to constant factors. 
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4.5 Proof of Theorem 2 

An examination of the proof of Theorem 1 shows that to control the probabihty of deviation 
from the expected convergence rate, we need to control two terms: the squared error sequence 
=0^ 2^7 11^*11* sequence Ylt=o i^t, zt - x*). The next two lemmas handle these terms. 

Lemma 4. Let X he compact with \\x — x*\\ < R for all x £ X. Under Assumption B, we have 



Consequently, with probability at least 1 — 5 



i—n ' J \ / 



(20) 



1 1.. 1 



t=o * ' 
Lemma 5. In the notation of Theorem 2 and under Assumption B, we have 

log I 



T-l , T-l , -1^2 



t=o " t=o " 

Consequently, with probability at least 1 — 5, 
T-l , T-l 



<max< 1 1 ; ~~r^e r- (22) 



See Appendices C and D, respectively, for the proofs of these two lemmas. 



(23) 



Equipped with these lemmas, we now prove Theorem 2. Let us recall the deterministic bound (19) 
from the proof of Theorem 1: 

^[f{xT) + ^{xT)-f{x*)-^{x*)] 



'T-l 

T-l , T-l 



< ltV(x*) + ^v(x*) + E ^ 11^*11* + E ^ (^*' ^* - + ^^0^- 



^ t=0 ' t=0 



Noting that < Ot for t G {0, . . . , T — 1}, Lemma 5 implies that with probability at least 1 — 5 

^^-1 E ^ 11^*11* ^ E T^IE[||et||2] + — max |log(l/<5), 72e(logr + 1) log(l/5)| . 
Applying Lemma 4, we see that with probability at least \ — 5 



The terms remaining to control are deterministic, and were bounded previously in the proof of 
Theorem 1; in particular, we have 



2 J . 6L1 Q\^ir]T ^ 47?r ^ 
i2'_iLx < , — < , and ^t-i^ S 



ALqu 

Combining the above bounds completes the proof. 
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5 Discussion 



In this paper, we have analyzed smoothing strategies for stochastic non-smooth optimization. We 
have developed methods that are provably optimal in the stochastic oracle model of optimization 
complexity, and given — to our knowledge — the first variance reduction techniques for non-smooth 
stochastic optimization. We think that at least two obvious questions remain. The first, to which 
we have alluded earlier, is whether the randomized smoothing is necessary to achieve such optimal 
rates of convergence. The second question is whether dimension-independent smoothing techniques 
are possible, that is, whether the d-dependent factors in the bounds in Corollaries 1-4 are neces- 
sary. Answering this question would require study of different smoothing distributions, as the 
dimension dependence for our choices of is tight. We have outlined several applications for which 
smoothing techniques give provable improvement over standard methods. Our experiments also 
show qualitatively good agreement with the theoretical predictions we have developed. 
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A Proof of Lemma 1 

The proof of this lemma requires several auxiliary results on sub-Gaussian and sub-exponential 
random variables, which we collect and prove in Appendix F. 

Define Xi = V/^(xt) - gi^t and Sm = Eili^j' so ^Sm = ^ffi{xt) - ^YT=i9i,t- Note that 
conditioned on J-t-i, the Xi are independent, so that for L = 2Lq, we have H-'^^jll^ < L, and we can 
apply Lemma 16 from Appendix F. In particular, we see that || ^5*^11 — E || ^'S'm|| is sub-Gaussian 
with parameter at most 4L^/m. Consequently, we can apply Lemma 13 from Appendix F so as to 
obtain 

^ [ ■'^rn\\^U\l \ ^ 1 fmimS^l? s \ 

^ exp < ^- exp 



8L2 y-yr^^^^^\^ 8L2 l-s^ 

Moreover, we can weaken the sub-Gaussian parameter AL? /m with max{E ^AL? /m}: 

( A2 max{4LV"^:IE||iS^||^" 
E [exp(A(||5^AnL - E ||5^/m||J)] < exp ^ , " 
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Recalling that for any random variable X, Jensen's inequality gives (EX) < EX , we have 

' s\\hSj\l \ _ I ( ^WhSmt S \ 



^exp — — , _ < exp 



2max{E||^S„||^AL2}; V2max{E ||i,S„,||^ ^^2} 

1 /I s 

< —== exp 



\2 1-s 
By taking s = ^, we get 



L=exp Q ^) = v/2exp Q) < exp(l). 



Replacing L with 2Lq completes the proof. 

B Proof of Lemma 2 

Define the linearized version of the cumulative objective 

* 1 

^tiz) ■■= ^IfM + {9r, z-yr)+ ^(z)], (24) 

r=0 

and let i-i{z) denote the indicator function of the set X. For conciseness, we adopt the shorthand 

a't'^ = Lt + r]t/6t and (j)t{x) = f^^{x) + ip{x). 
By the smoothness of /^^ , we have 

f^ltixt+l) + f{xt+i) < f^ltiyt) + {^ftJit{yt),xt+i -yt) + ^ \\xt+i - ytf + '^(xt+i)- 
From the definition (7a)-(7c) of the triple {xt,yt, zt), we obtain 

Mxt+i) < UM) + {VU,iyt),dtZt+i + (1 - et)xt) + y \\etzt+i - Btztf + ^(Btzt^i + (1 - Qt)xt). 
Finally, by convexity of the regularizer y?, we conclude 



ff^tivt) + f^it{yt),zt+i - yt) + ^{zt+i) + \\zt+i - zt"'^ 

+ (1 - et)[U,{yt) + {Vf^,{yt),xt - yt) + ^{xt)]. (25) 
By the strong convexity of V') it is clear that we have the lower bound 

1 2 

D^{x,y) = ipix) -ipiy) - {V'il){y),x-y) > - \\x - y\\ . (26) 
On the other hand, by the convexity of /^^ , we have 

UAyt) + {yu^(.yt),xt - yt) < UM)- (27) 
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Substituting inequalities (26) and (27) into the upper bound (25) and simplifying yields 

4>t{xt+i) < Ot [ftitivt) + (V/^,(yt),2;t+i - yt) +if{zt+i) + LtOtD^izt+i, zt)] + (1 - Ot)[ff,,{xt) + ip{xt)]. 
We now re-write this upper bound in terms of the error et = ffniut) — dt- In particular, 

< Ot iftttiyt) + {gt, zt+i - yt) + ^{zt+i) + Lt9tD^{zt+i,zt)] 

+ (1 - et)[ff,,{xt) + 'pixt)] + Ot {et,zt+i - yt) 
= el [it{zt+i) - it-i{zt+i) + LtD^{zt+i,zt)] + (1 - et)[ff,,{xt) + ifixt)] + Ot {et,zt+i - yt) . (28) 

Using the fact that zt minimizes it-i{x) + ■^^p{x) , the first order conditions for optimality imply 
that for all g G d£t-i{zt), we have [g + ■^'V'il){zt),x — zt) > 0. Thus, first-order convexity gives 



It-iix) -lt-i{zt) > {g,x- Zt) > — - {V'ilj{zt),x - zt) = —ipizt) - -i^ix) + —D^{x,zt). 

at at a at 

Adding £t{zt+i) to both sides of the above and substituting x = zt+i, we conclude 

£t{zt+i) - lt-i{zt+i) < £t{zt+i) - it-i{zt) - —i^{zt) + —i^{zt+i) - —D^{zt+i,zt). 

at at at 

Combining this inequality with the bound (28) and using the definition a^^ = Lt + rjt/Ot, we find 

f^,t{xt+i) + <p{xt+i) < el 



itizt+i) - it{zt) - —i^{zt) + —^{zt+i) - '^Dj,{zt+i,zt) 
at at et 



<el 



+ (1 - et)[fi,, {xt) + ^{xt)] + et {et, zt+i - yt) 

lt{zt+i) - it{zt) - —i^izt) H —i/jizt+i) - '^D^{zt+i,zt) 

at at+i et 

+ (1 - et) iU, {xt) + ifixt)] + et {et, zt+i - yt) 

since a^^ is non-decreasing. We now divide both sides by e^, and unwrap the recursion. Recall 
that (1 — et)/e1 = l/e^^i and < f^t_^ by construction, so we obtain 

72 [/Mt(^t+i) + "fi^t+i)] < ^-ET^iftitixt) + p{xt)] + it{zt+i) - £t{zt) - —i^izt) + -^tp{zt+i) 
Of t)^ at at+i 

- '!^D^{zt+i,zt) + ^ {et, zt+i - yt) 

tit 

- 72~[-/'/^t(^*) + '^(^t)] + ^t{zt+i) - it{zt) ip{zt) H il'izt+i) 

Of^i at at+i 

- ^D^{zt+i,zt) + ^ {et, zt+i - yt) 

tit 

(") 1 11 

^ ■E2—ifut~lixt) + P{xt)]+it{zt+l)-it{zt) Tp{zt) + llj{zt+l) 

Ot_i at at+i 

- '^D^{zt+i,zt) + ^ {et, zt+i - yt) ■ 
t^t t/t 
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The equality (i) follows since (1 — Ot)/Ol = l/O^^^, while the inequality (ii) is a consequence of the 
fact that < /^j„i. By applying the three steps above successively to [fi_it_i{xt) + (p{xt)]/6i_i, 
then to [//^t_2 (^t-i) + ^i^t-i)]/Si_2, and so on until t = 0, we find 

t t 



r=0 ' T=0 ' 

By construction, = I5 we have £_i(zo) = 0, and z^+i minimizes lt{x) + -^^^^'il:{x) over X. 
Thus, for any x* € Af, we have 

Recalling the definition (24) of £t and noting that /^^ (yt) + ( V/^^ {yt),x — yt) < f^t (x) by convexity, 
we expand and have 

* 1 1 * * 1 

- 5Z + ^* ~ + '^(^*)] + ~ X] i-D^{Zr+l,Zr) + X ^^+1 ~ 



< E + ^(^*)] + ^^(^*) - E + E ^ ^-+1 - ■ (29) 

T=0 ^ r=0 r=0 

1iiii2 1iiii2 

Now we use the Fenchel- Young inequality applied to the conjugates ^ IHI and ^ \\'\\*, which gives 

{et,zt+i - X*) = {et,zt - X*) + {et,zt+i - zt) < {et,zt - x*) + \\et\\l + ^ \\zt - zt+i\f . 

2r]t 2 



In particular. 



-^D^,{zt+i,zt) + ^ {et, zt+i - X*) < \\et\\l + ^ {et, zt - x* 



Using this inequality and rearranging (29) gives the statement of the lemma. 



C Proof of Lemma 4 



Consider the sequence Ylt=o 'ff^ i^t, zt 



f). Since X is compact and — x* 
{et,zt - X*) < \\et\l R. Further, E[{et,zt - x*) \ Tt^i] = 0, so that ^ {et,zt - x' 
difference sequence. Further, by setting q = Ra/Ot, we have 



< R, we have 
is a martingale 



E 



exp 



' {et,zt 



t-i 



< E 



exp 



\et\\iR^ 



E 



exp 



< exp(l) 
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by Assumption B. By applying Lemma 18 from Appendix F, we conclude that ^ {et,zt — x*) is 
(conditionally) sub-Gaussian with parameter < AR'^ a"^ / 39^ . Applying the Azuma-Hoeffding 
inequality (see Eq. (41), Appendix F) yields 



w 



t=0 



< exp 



3w' 



Setting w = ejQT-x yields that 



T-i ^ 



< 



exp 



3e2 



Noting that Qt~\ < Ot for any t < T, we have ijV^ ^^Jq^ ^ < R'^a^ YlJ=o 1 = ^^^2", dividing 
e again by 0t-1i and recalling that 9t-i < jqrx' have 



T-l 



2:* - X*) > e 



i=0 



< 



exp 



12(T+ l)e2 



< 



exp 



3re2 
"2flV2 / ' 



as claimed (20). The second claim (21) follows by setting 6 = exp(— ^^^), and then solving to 

,2 _ 2flV2 1 1 
ST 5 • 



obtain e 



D Proof of Lemma 5 

Again, recall the a-fields J-'t defined prior to Assumption B. Define the random variables 

Xt:=:^\\et\\l-^E[\\et\\l\Tt-i]. 
2??t 2i]t 

As an intermediate step, we claim that for A < r]t/2a'^, the following bound holds: 
E[exp(AXt) I = E 



exp( A(||e,||2_E[||e,|L2| J-,_i])) I 



< exp ( ^AV^ ) . (30) 



For now, we proceed with the proof, returning to establish this intermediate claim later. 

The bound (30) implies that Xt is sub-exponential with parameters At = r]t/2a^ and < Idea'^/rj^. 
Since rjt = i]\/t + 1, it is clear that mmt{At} = Aq = r]o/2a'^. By defining = Ylt=o '^ti ^^"^ 
apply Theorem 1.5.1 from the book [7, pg. 26] to conclude that 



'Tx,>^<hn-^) f-»£'£A„c^ (31) 

/ [exp(-^) otherwise, i.e. e > AqC^, 



which yields the first claim in Lemma 5. 

The second statement involves inverting the bound for the different regimes of e. Before proving 
the bound, we note that for e = AqC^, we have exp(— e^/2C^) = exp(— Ae/2), so we can invert 
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each of the exp terms to solve for e and take the maximum of the bounds. We begin with e in the 
regime closest to zero, recalling that 7]t = rj\/t + 1. We see that 

C < 2- y TTT ^ 2- logC^ + 1 • 



Thus, inverting the bound 6 = exp(— e^/2C^), we get e = y log or that 

with probability at least 1 — 5. In the large e regime, we solve 5 = exp(— rye/4cr^) or e = log |, 
which gives that 

T'-l -, T-l , .2-1 

1 II l|2 ^ 1 w II ||2 ^ 4CT^ 1 1 

> 7^ et ^< > — E et ^ + log- 

t=o t=o ' 

with probability at least 1 — 5, by the bound (31). 

We now return to prove the intermediate claim (30). Let X := ||et||^. By assumption, we have 
Eexp exp(l), so for A E [0, 1] we see 

F{X^/a^ > e) < E[exp{XX^/a^)] exp(-Ae) < exp(A - Ae) 

and replacing e with 1 + e we have P(X^ > cr^ + ecr^) < exp(— e). If ea^ > cr"^ — EX^, then 
(j2 _ £^2 _^ g^2 < 2eCT2 so 

¥{X^ > EX^ + 2e(j^) < F{X^ > + ea^) < exp(-e), 

while for ea^ < - EX^, we clearly have F{X'^ - EX^ > ea^) < 1 < exp(l) exp(-e) since e < 1. 
In either case, we have 

F{X^ - EX^ > e) < exp(l) exp (-^) • 
For the opposite concentration inequality, we see that 

F{{EX^ - X^)/a^ > e) < E[exp(AEXVcT2) eiip{-XX^ /a^)] exp(-Ae) < exp(A - Ae) 
or P(X^ — EX'^ < — (T^e) < exp(l) exp(— e). Using the union bound, we have 

F{\X^ -EX^\ > e) < 2exp(l)exp (^-^) . (32) 

Now we apply Lemma 17 to the bound (32) to see that ||et||^ - E[||et||^] = X'^ - EfX^] is 
sub-exponential with parameters A > o"^ and < 32ea^. 
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E Properties of randomized smoothing 

In this section, we discuss the analytic properties of the smoothed function from the convolu- 
tion (3). We assume throughout that functions are sufficiently integrable without bothering with 
measurability conditions (since F{-;^) is convex, this is no real loss of generality [4, 31]). By Fubini's 
theorem, we have 

Mx)= [ [ F{x + y;C)dPiC)fi{y)dy= [ [ F{x + y;i)ix{y)dydP{i) = f F^{x-OdP{i). 
Jr^ Je Je Je 

Here F^{x;S^) = {F{-;^) * fi{—-)){x). We begin with the observation that since fi is a density with 
respect to Lebesgue measure, the function is in fact differentiable [4]. So we have already made 
our problem somewhat smoother, as it is now differentiable; for the remainder, we consider finer 
properties of the smoothing operation. In particular, we will show that under suitable conditions 
on fi, and P, the function is uniformly close to / over X and V/^ is Lipschitz continuous. 

E.l Statements of smoothing lemmas 

A remark on notation before proceeding: since / is convex, it is almost-everywhere differentiable, 
and we can abuse notation and take its gradient inside of integrals and expectations with respect to 
Lebesgue measure. Similarly, is almost everywhere differentiable with respect to Lebesgue 

measure, so we use the same abuse of notation for F and write VF{x + Z;S^), which exists with 
probability 1. We prove the following set of smoothness lemmas at the end of this section. 

Lemma 6. Let be the uniform density on the i^o -ball of radius u. Assume thatE,[\\dF{x;^)\\'^] < Lq 
for all X £ X + Boo{0, u) Then 

(t) fix) < f^{x) < fix) + 

(ii) f^i is Lo-Lipschitz with respect to the ii-norm over X . 

(Hi) f^ is continuously differentiable; moreover, its gradient is ^-Lipschitz continuous with respect 
to the £i-norm. 

(iv) Let 11. r/ienE[VF(x + Z;e)] = V/^(x) and E[||V/^(a;) - VF(x + Z; < 4Lg. 

There exists a function f for which each of the estimates (i)-(iii) are tight simultaneously, and (iv) 
is tight at least to a factor of l/A. 

Remark: Note that the hypothesis of this lemma is satisfied if for any fixed ^ G H, the function 
Fi-;^) is Lo-Lipschitz with respect to the ^i-norm. 

The following lemma provides bounds for uniform smoothing of functions Lipschitz with respect 
to the -^2-norm while sampling from an ^oo-ball. 

Lemma 7. Let /i be the uniform density on B^iO,u) and assume that E[\\dFix;C)\\l] < Ll for 
X £ X + Booi^tu). Then 

(i) The function f satisfies the upper bound fix) < /^(x) < fix) + LQU\fd 
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(ii) The function f^ is L^-Lipschitz over X . 

(Hi) The function is continuously differentiahle; moreover, its gradient is Lipschitz con- 

tinuous. 

(iv) For random variables Z fi and ^ ~ P, we have 

E[VF{x + Z;0] = Vf^{x), and E[\\V f^{x) - VF{x + Z; OWl] < LI 
The latter estimate is tight. 

A similar lemma can be proved when fi is the density of the uniform distribution on B2{0,u). 
In this case, Yousefian et al. give (i)~(iii) of the following lemma [38]. 

Lemma 8 (Yousefian, Nedic, Shanbhag). Let fn be defined as in (3) where fi is the uniform density 

2 9 

on the l2-hall of radius u. Assume that E[||9F(x; Hj] < Lq for x ^ X + i?2(0, n). Then 

(i) f{x)<f^{x)<f{x) + Lou 

(ii) fn is Lo-Lipschitz over X . 

(Hi) f^ is continuously differentiable; moreover, its gradient is -Lipschitz continuous. 

(iv) Let Z^ 11. ThenE[VF{x + Z-i)] = Vf^{x), andE[\\V f^,{x) -V F{x + Z;i)\\l] < Ll. 

In addition, there exists a function f for which each of the bounds (i)~(iv) is tight — cannot he 
improved by more than a constant factor — simultaneously. 

Lastly, for situations in which is i^o-Lipschitz with respect to the ^2-iiorm over all of M*^ 

and for P-a.e. ^, we can use the normal distribution to perform smoothing of the expected function 
/. The following lemma is similar to a result of Lakshmanan and de Farias [17, Lemma 3.3], 
but they consider functions Lipschitz-continuous with respect to the ^oo-norm, i.e. — f{y)\ < 

L \\x — ylloo, which is too stringent for our purposes, and we carefully quantify the dependence on 
the dimension of the underlying problem. 

Lemma 9. Let /i be the N{0,u^lcixd) distribution. Assume that F{-;S^) is L^-Lipschitz with respect 
to the £2-norm — that is 

sup{||5|l2 I 9 G dF{x;i),x £ X] < Lq for P-a.e. ^ 
Then the following properties hold: 

(i) fix) < /^(x) < fix) + LouVd 

(ii) ffi is Lq- Lipschitz with respect to the I2 norm 

(Hi) fiJL is continuously differentiahle; moreover, its gradient is ^-Lipschitz continuous with respect 
to the i2-norm. 

(iv) LetZ^fi. ThenE[VFix + Z-C)] = Vf^ix), andE[\\Vf^ix)-VFix + Z;C)\\l]<Ll. 
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In addition, there exists a function f for which each of the bounds (i)-(iv) cannot be improved by 
more than a constant factor. 

Our final lemma illustrates the sharpness of the bounds we have proved for functions that are 
Lipschitz with respect to the ^2-norm. Specifically, we show that at least for the normal and uniform 
distributions, it is impossible to get more favorable tradeoffs between the uniform approximation 
error of the smoothed function and the Lipschitz continuity of V/^. We begin with the following 
definition of our two types of error, then give the lemma: 

Eu{f) := inf {L G M I sup |/(x) - f^{x)\ < L} (33) 

Ey{f):=mf{LeR \ \\Vf^{x)-Vf^{y)\\^<L\\x-y\\^ y x,y e X} (34) 

Lemma 10. For fi equal to either the uniform distribution on B2{0,u) or N{Q,u^Idxd), there exists 
an LQ-Lipschitz continuous function f and dimension independent constant c > such that 

EuU)Ev{f) > cLlVd. 

Remark The importance of the above bound is made clear by inspecting the convergence 
guarantee of Theorem 1. The terms Li and Lq in the bound (8) can be replaced with E\r{f) 
and Eij{f), respectively. Minimizing over u, we see that the leading term in the convergence 

guarantee (8) is of order V^'^^-f^^uiDi'i^ ) ^ — ^^^^ ^ . In particular, this result shows that 
our analysis of the dimension dependence of the randomized smoothing in Lemmas 8 and 9 is sharp 
and cannot be improved by more than a constant factor (see also Corollaries 1 and 2). 



E.2 Proof of smoothing lemmas 

The following technical lemma is a building block for our results; we provide a proof in Sec. E.2. 5. 

Lemma 11. Let f be convex and LQ-Lipschitz continuous with respect to a norm \\-\\ over the 
domain supp/i + X. Let Z be distributed according to the distribution /i. Then 

l|V^(x)-V^(2/)||, = ||E[V/(x + Z) + V/(y + Z)]L<Lo J Hz - x) - f^iz - y))\dz . (35) 

If the norm \\-\\ is the £2-norm and the density ^{z) is rotationally symmetric and non-increasing 
as a function of \\z\\2, the bound (35) is tight; specifically, it is attained by the function 



fix) = Lo 



y \ 1 



\y\\2 I 2 



E.2.1 Proof of Lemma 6 



Throughout, we let Z ~ /i, where [i is the uniform density on Soo(0,ti)), and h^ix) denote the 
(shifted) Huber loss 

[ |x| otherwise. 
Now we prove each of the parts of the lemma in turn. 
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(i) Since E[Z] = 0, Jensen's inequality shows f{x) = f{x + K[Z]) < E[f{x + Z)] = f^{x), 
by definition of Now recall the definition of ||5/(x)|| = sup{||5|| | g € df{x)} from the 
introduction. To get the upper uniform bound, note first that by assumption, / is Lp-Lipschitz 
continuous over X + Soo(0, n), since by assumption 



\\df{x)\\^ < n\\dF{x-OU < ^/n\\dF{x;0\t] < ^o, 
again using Jensen's inequality. Thus / is Lo-Lipschitz with respect to the ^i-norm, 

Uix) = E[fix + Z)] < E[/(x)] + LoE[||Z||J = fix) + 

To see that the estimate is tight, note that for f{x) = \\x\\i, we have /^(x) = Yli=i hu{xi), 
where /i„ is the shifted Huber loss (36), and /^(O) = du/2, while /(O) = 0. 

(ii) We now prove that is Lo-Lipschitz with respect to ||-||;^. Under the stated conditions, we 
have df{x) = E[dF{x;C)], which shows that ||(9/(x)||^ < E[||aF(x; IlLl < ^o- Thus, we 
obtain the upper bound 

||V^(x)||^ = \\E[Vf{x + Z)]\\^ < E[\\Vf{x + Z)\\J < Lo. 

Tightness follows again by considering f{x) = \\x\\-^, where Lq = 1. 

(iii) Recall that differentiability is directly implied by earlier work of Bertsekas [4]. Since / is a.e.- 
differentiable, we have V/^(x) = E[V/(x + Z)] for Z uniform on [—u,u]'^. We now establish 
Lipschitz continuity of V f^{x). 

For a fixed pair x^y ^ X + i?oo(0, n), we have from Lemma 11 

||E[V/(x + Z)] - E[V/(y + Z)] |L < • ^ A(i?oo(x, u)/\B^ (y, u)) , 

where A denotes Lebesgue measure and A denotes the symmetric set-difference. By a straight- 
forward geometric calculation, we see that 

X{B^{x,u)AB^iy,u)) = 2({2uy-Y[[2u-\xi-yi\]A. (37) 

^ i=i ^ 

To control the volume term (37) and complete the proof, we need an auxiliary lemma (which 
we prove at the end of this subsection). 

Lemma 12. Let a and u £ M+. Then H^^i [u - Ui]^ > u'^ - \\a\\-^ n'^"^ 

The volume (37) is easy to control using Lemma 12. Indeed, we have 

{B^{x,u)AB^{y,u)) < (2n)'^ - (2^)^^ + ||x - y\\^ (2n)^-\ 
which implies the desired result, that is, that 

||E[V/(x + Z)] - EV[/(y + Z)] \\^ < ^"""^^"^"^ 

To see the tightness claimed in the proposition, consider as usual f{x) = \\x\\-^ and let 
denote the ith standard basis vector. Then Lq = 1, V/^(0) = 0, Vfn{uei) = Ci, and 
||V^(0) - V^(txe,)IL = 1 = if l|0 - ne^W,. 
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(iv) The equality E,[VF{x + Z;(^)] = V/^(x) follows from Fubini's theorem. The second statement 
is simply a consequence of the triangle inequality. Finally, the tightness follows from the 
following one-dimensional example. Let f{x) = Lq\x\ for x G M and Lq > 0. Then f^{x) is 
Lq times the Huber loss hu{x) defined earlier, and /^(O) = 0. Thus for Z uniform on [— n, u], 

E(/;(0) - f'{Z)f = E[L2sign(Z)2] = L^, 

which is the Lipschitz constant of /. 

Proof of Lemma 12 We begin by noting that the statement of the lemma trivially holds 
whenever ||a||^ > n, as the right hand side of the inequality is then non-positive. Now, fix some 
c < u, and consider the problem 



mm 



J^(n — aj)+ s.t. a ^ 0, ||a||-|^ < c. (38) 



=1 



We show that the minimum is achieved when one index is set to = c and the rest to 0. Indeed, 
suppose for the sake of contradiction that d is the solution to (38) but that there are indices i,j 
with m > Qj > 0, that is, at least two non-zero indices. By taking a logarithm, it is clear that 
minimizing the objective (38) is equivalent to minimizing Yli=i log(u — Oj). Taking the derivative 
of log(ti — ttj) for i and j, we see that 

— log(M - Qi) = < = — log(ti - aj). 

oai u — ai u — aj oaj 

Since is decreasing function of a, increasing Oj slightly and decreasing aj slightly causes 
log(ti — aj) to decrease faster than log(ti — aj) increases, thus decreasing the overall objective. 
This is the desired contradiction. □ 



E.2.2 Proof of Lemma 7 

The proof of this lemma is nearly identical to the proof of Lemma 6, though we replace \\-\\^ 
norms with ||-||2- We prove each of the statements in turn, and throughout let Z denote a variable 
distributed uniformly on Boo{0,u). 

(i) Jensen's inequality implies that f{x) = /(x + E[Z]) < E[/(x + Z)] = /^(x). For the upper 
bound on use the Lipschitz continuity of / and Jensen's inequality to see that 



Uix) < f{x) + LoE[||Z||2] < /(x) + Lo^E[||Z||| = /(x) + Lo^^- 

(ii) As earlier, since E[V/(x + Z)] = V/^(x), we have ||E[V/(x + Z)]||2 < E[||V/(x + Z)||2] < Lq 

(iii) Using the same sequence of steps as in the proof of part (iii) in Lemma 6, we see that 

\\VUix) -Vf^{y)\\^< j^LoX{B^{x,u)AB^{y,u)) 

<jiyLoi2ur-'\\--yh<^\\--yh- 
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(iv) As in the proof of Lemma 6, Fubini's theorem impHes the first part of the statement, while 
the second part is a consequence of the fact that 

E[||V/^(x) - VF(x + Z;ml] = lE[||VF(x + Z-^Wl] - ||V^(x)||2 < 

by the assmnptions on F. Tightness follows from considering the one dimensional function 
f{x) = \x\ as earlier. 

E.2.3 Proof of Lemma 9 

Throughout this proof, we use Z to denote a random variable distributed as A^(0,ii^/). 

(i) As in the earlier lemmas, Jensen's inequality gives /(x) = /(x + EZ) < E/(x + Z) = f^{x). 
Our assumption on dF{-;S^) implies that / is Lo-Lipschitz, so 



/^(x) = E[/(x + Z)] < E[/(x)] + LoE[||Z||2] < /(x) + Lo^JE[\\Z\\l] = f{x) + LouVd. 

(ii) This proof is analogous to that of part (ii) of Lemmas 6 and 7. The tightness of the Lipschitz 
constant can be verified by taking /(x) = (f,x) for v G M*^, in which case /^(x) = /(x), and 
both have gradient v. 

(iii) Now we show that V/^ is Lipschitz continuous. Indeed, applying Lemma 11 we have 

||V^(x) - V^(y)|L <Lo [ - x) - ^(2 - y)\dz. (39) 



What remains is to control the integral term (39), denoted I2. 

In order to do so, we follow a technique used by Lakshmanan and Pucci de Farias [17]. Since 
H satisfies iJ,{z — x) > fi{z — y) if and only if \\z — xHg > \\z — y||2, we have 



\li{z — x) — ij{z — y)\dz = 2 / (/i(z — x) — fi{z — y))dz. 

J z:\\z-x\\^<\\z-y\\^ 

By making the change of variable w = z — x for the /^(z — x) term in I2 and w = z — y for 
^{z — y), we rewrite I2 as 



I2 = 2 ^{w)dw — 2 / ^{w)dw 

J w:\\w\\,^<\\w-{x~y)\\^ J w:\\w\\.^>\\w-(x~y)\\^ 

= 2P^(||Z||2 < \\Z - (x - 2/)|y - 2P^(||Z||2 > \\Z - (x - y)\\^) 

where denotes probability according to the density /x. Squaring the terms inside the 
probability bounds, we note that 

r,-(l|Z|l2 < V-ix- 'Mi) =P„(2{Z,a;-i/) < \\x - y\\f^ 
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Since {x — y)/ \\x — y\\2 has norm 1 and Z ~ N{0,u'^I) is rotationally invariant, the random 
h 



variable W = (^Z, ij^fr^py has distribution N{0,u'^). Consequently, we have 



2 ¥{W <\\x- y||2 /2) - ¥{W >\\x- yW^ /2) 

2/2 /■oo 

; exp( — til /{2u^))dw 



1 „ 
< -= \\X — 2/II2 : 



where we have exploited symmetry and the inequality exp(— id^) < 1. Combining this bound 
with the earlier inequality (39), we have 

2-^>o II II ^ Lq 
— ^= \\x — yUr. < — 



l|V/^(x) - V/^(y)||2 < y= \\x - y||2 < \\x- y||2 



(iv) The proof of the variance bound is completely identical to that for Lemma 7. 
That each of the bounds above is tight is a consequence of Lemma 10. 

E.2.4 Proof of Lemma 10 

Throughout this proof, c will denote a dimension independent constant and may change from line 
to line and inequality to inequality. We will show the result holds by considering a convex combi- 
nation of "difficult" functions, in this case fi{x) = Lq \\x\\2 and f2{x) = Lq \{x,y/ II2/II2) — 1/2|, and 
choosing / = i/i + i/2. Our first step in the proof will be to control Eu. 

By definition of the constant Eu in Eq. (33), for any convex /i and /2 we have Eu{^fi + 
i/2) > lmax{Eu{fi),Eu{f2)}. Thus for Z ~ N{0,u^lay,d) we have E[h{Z)] > cLouVd, i.e. 
Euif) > cLouVd, and for Z uniform on B2{0,u), we have E[/i(Z)] > cLqu, i.e. Eu{f) > cLqu. 

Turning to control of -Ey, we note that for any random variable Z rotationally symmetric about 
the origin, symmetry implies that 



E[V/i(Z + y)] = LoE 



Z + y 



azV 



\Z + y\\2 

where > is a constant dependent on Z. Thus we have 

E[V/i(Z)] - E[V/i(Z + y)] +E[Vf2{Z)] - E[V/2(Z + y)] = 0- a,y - [ |/i(z) - fi{z - y)\dz 

from Lemma 11. As a consequence (since a^y is parallel to y/ ||y||2), we see that 

Q/i + ^72) > ^Lo I |/i(z) - ^i{z - y)\dz. 

So what remains is to lower bound f |/i(z) — fi{z — y)\dz for the uniform and normal distributions. 
As we saw in the proof of Lemma 9, for the normal distribution 

' luiz) - n{z - y)\dz = — ^ / exp{-w^/{2u^))dw = — ^ ||y||2 + O ' "^"^ > 



u\/2iT J-\\y\\^/2 uv27r \ u 
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By taking small enough ||y||2, we achieve the inequality £'v (^/i + ^/2) > when Z ~ A^(0, u^Idxd) 

To show that the bound in the lemma is sharp for the case of the uniform distribution on 
-62(0, u), we slightly modify the proof of Lemma 2 in [38] . In particular, by using a Taylor expansion 
instead of first-order convexity in inequality (11) of [38], it is not difficult to show that 



/ 



IM^) - - y)\dz = + ll^ll' 



where k = 2/tt if d is even and 1 otherwise. Since d\l/{d — 1)!! = 0(\/d), we have proved that for 
small enough ||y||2, there is a constant c such that J \fJ.{z) — ^{z — y)\dz > c\/d\\y\\2 /u. 

E.2.5 Proof of Lemma 11 

Without loss of generality, we assume that x = {a linear change of variables allows this). Let 
g : Mf^ — )■ M"^ be a vector- valued function such that ||5((2;)||^ < Lq for all z G {y} + supp^. Then 



E[g{Z) - g{y + Z)] = / g{z)fi{z)dz - J g{y + z)fi{z)dz 

g{z)fi{z)dz - / g{z)ii{z - y)dz 



g{z)[fi{z) - fi{z - y)]dz - g{z)[fi{z - y) - fi{z)]dz (40) 
/> J/< 

where /> = {z S \ fj,{z) > fi{z — y)} and /< = {z G M*^ | fi{z) < fj,{z — y)}. It is now clear that 
when we take norms we have 



\\Eg{Z) - g{y + Z)\l < sup \\g{z) 

ze/>u/< 



' [u{z) — u{z — y)]dz + / [u{z — y) — u{z)\dz 
i> Ji< 



< Lo l-i{z) - n{z - y)dz + fi{z -y) - ^i{z)dz 

= j \i-i{z) - n{z - y)\dz. 

Taking g{z) to be an arbitrary element of df{z) completes the proof of the bound (35). 

To see that the result is tight when /i is rotationally symmetric and the norm ||-|| = ||-||2, we 
note the following. From the equality (40), we see that ||E[g(Z) — g[y + Z)\\\2 is maximized by 
choosing g{z) = u for z G /> and g{z) = —v for z £ for any v such that ||u||2 = Lq. Since /i is 
rotationally symmetric and non-increasing in \\z\\2, 

/> = {z e M'^ I fi{z) > fi{z-y)} = {zeW''\ \\z\\l <\\z- y\\l} = |z G M'^ |(z, y) < ^ 

/< = |z G M'^ I fi{z) <fi{z-y)] = [z€R''\ \\z\\l > \\z - y\\l] = G R'' \ {z,y) > ^ \\y\\l^ . 

So all we need do is find a function / for which there exists v with ||f||2 = Lq, and such that 
df{x) = {v} for X G /> and df{x) = {— f} for 2; G /<. By inspection, the function / defined in the 
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F Sub-Gaussian and sub-exponential tail bounds 

For reference purposes, we state here some standard definitions and facts about sub-Gaussian and 
sub-exponential random variables (see the books [7, 21, 35] for further details). 

F.l Sub-Gaussian variables 

This class of random variables is characterized by a quadratic upper bound on the moment gener- 
ating function: 

Definition F.l. A zero-mean random variable X is called sub-Gaussian with parameter cj^ if 
Eexp(AX) < exp{a'^X'^/2) for all A G M. 

Remarks: If Xj, i = 1, . . . ,n are independent sub-Gaussian with parameter o"^, it follows from 
this definition that ^ Yl^=i -^i sub-Gaussian with parameter a^/n. Moreover, it is well-known 
that any zero-mean random variable X satisfying \X\ < C is sub-Gaussian with parameter o"^ < C^. 



Lemma 13 (Buldygin and Kozachenko [7], Lemma 1.6). Let X — EX be sub-Gaussian with pa- 
rameter (T^. Then for s G [0, 1], 

fsX^\ 1 /(EX)2 

Eexp — ^ < ^ exp ' ^ ' 



2a2 J - V 1 - 

The maximum of d sub-Gaussian random variables grows logarithmically in d, as shown by the 
following result: 

Lemma 14. Let X G M'' 6e a random vector with sub- Gaussian components, each with parameter 
at most a'^. T/ien E||X||^ < maxida"^ logd,2a'^} . 

Using the definition of sub-Gaussianity, the result can be proved by a combination of union bounds 
and Chernoff's inequality (see van der Vaart and Wellner [35, Lemma 2.2.2] or Buldygin and 
Kozachenko [7, Chapter II] for details). 

The following martingale-based bound for variables with conditionally sub-Gaussian behavior is 
essentially standard [2, 13, 7]. 

Lemma 15 ( Azuma-Hoeffding) . Let Xj be a martingale difference sequence adapted to the filtra- 
tion Fi, and assume that each Xj is conditionally sub-Gaussian with parameter af , meaning that 
E[exp(AXj) I Ti^i] < exp(AVf/2). Then for all e > 0, 



1 " 
n ^-^ 



n 

> e 



The next lemma uses martingale techniques to establish the sub-Gaussianity of a normed sum: 

Lemma 16. Let Xi,...,X„ be independent random vectors with ||Xj|| < L for all i. Define 
Sn = X^iLi Xj. Then IIS^H — E HS^H is sub-Gaussian with parameter at most AnL'^. 
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Proof The proof follows from the realization that when ||Xj|| < L, the quantity ||5'„|| — E HS'nH 
can be controlled using single-dimensional martingale techniques [21, Chapter 6]. We construct 
the Doob martingale for the sequence Xi. Let J^i be the cj- field of Xi, . . . ,Xi and define the real- 
valued random variables Zj = E[||5„|| | Fi] — E[||S'„|| | where J-q is the trivial cr-field. Let 
-^nV = T.j^^Xj■ Then E[Zi | Fi^i] = and 

|Z,| = |E[||5„|| \Fi.i]-n\\Sn\\ \H\ 

< |E[||5„v|| l-^i-i] -E[||5„v/|| \F]\+M\X^\\ I J-i_i]+E[||X,|| \Fi] 
= ||X,||+E[||Xi||] <2L 

since Xj is independent of Fi-i for j > i. Thus Zi defines a bounded martingale difference se- 
quence, and = \\Sn\\ — Since \Zi\ < 2L, the Zi are conditionally sub-Gaussian 
with parameter at most 4L^. Thus Yl7=i sub-Gaussian with parameter at most AnL?'. □ 



F.2 Sub-exponential random variables 

A slightly less restrictive tail condition defines the class of sub-exponential random variables: 
Definition F.2. A zero-mean random variable X is sub-exponential with parameters (A, r) if 



E[exp(AX)] < exp 




for all |A| < A. 



The following lemma provides an equivalent characterization of sub-exponential variable via a tail 
bound: 

Lemma 17. Let X be a zero-mean random variable. If there are constants a,a > such that 

^i\X\ >t)< aexp(-at) for all t > 
then X is sub- exponential with parameters A = a/2 and = 4a/a^. 

The proof of the lemma follows from a Taylor expansion of exp(-) and the identity E[|X|'''] = 
Jp°°P(|X|'^ > t)dt (for similar results, see Buldygin and Kozachenko [7, Chapter L3]). 

Lastly, any random variable whose square is sub-exponential is sub-Gaussian, as shown by the 
following result: 

Lemma 18 (Lan, Nemirovski, Shapiro [20], Lemma 6). Let X be a zero-mean random variable 
satisfying the moment generating inequality E[exp(X^/cj^)] < exp(l). Then X is sub-Gaussian with 
parameter at most 3/2cr^. 
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